Goto

Collaborating Authors

 Suriname



Global health's defining test

Al Jazeera

As we look back on 2025, the world experienced a year of both remarkable achievement and profound challenge in global health. Multilateralism, science and solidarity were tested as never before, underscoring a fundamental truth: International cooperation is not optional. It is essential if we are to protect and promote health for everyone, everywhere in 2026 and beyond. Perhaps the most significant milestone was the adoption by WHO Member States of the Pandemic Agreement, a landmark step towards making the world safer from future pandemics. Alongside this, amendments to the International Health Regulations came into force, including a new "pandemic emergency" alert level designed to trigger stronger global cooperation.


Optimal Lower Bounds for Online Multicalibration

Collina, Natalie, Lu, Jiuyao, Noarov, Georgy, Roth, Aaron

arXiv.org Machine Learning

We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration. In the general setting where group functions can depend on both context and the learner's predictions, we prove an $Ω(T^{2/3})$ lower bound on expected multicalibration error using just three disjoint binary groups. This matches the upper bounds of Noarov et al. (2025) up to logarithmic factors and exceeds the $O(T^{2/3-\varepsilon})$ upper bound for marginal calibration (Dagan et al., 2025), thereby separating the two problems. We then turn to lower bounds for the more difficult case of group functions that may depend on context but not on the learner's predictions. In this case, we establish an $\widetildeΩ(T^{2/3})$ lower bound for online multicalibration via a $Θ(T)$-sized group family constructed using orthogonal function systems, again matching upper bounds up to logarithmic factors.


Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Fadeeva, Ekaterina, Goloburda, Maiya, Rubashevskii, Aleksandr, Vashurin, Roman, Shelmanov, Artem, Nakov, Preslav, Sachan, Mrinmaya, Panov, Maxim

arXiv.org Machine Learning

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.


Data-Driven Learnability Transition of Measurement-Induced Entanglement

Qian, Dongheng, Wang, Jing

arXiv.org Artificial Intelligence

Measurement-induced entanglement (MIE) captures how local measurements generate long-range quantum correlations and drive dynamical phase transitions in many-body systems. Yet estimating MIE experimentally remains challenging: direct evaluation requires extensive post-selection over measurement outcomes, raising the question of whether MIE is accessible with only polynomial resources. We address this challenge by reframing MIE detection as a data-driven learning problem that assumes no prior knowledge of state preparation. Using measurement records alone, we train a neural network in a self-supervised manner to predict the uncertainty metric for MIE--the gap between upper and lower bounds of the average post-measurement bipartite entanglement. Applied to random circuits with one-dimensional all-to-all connectivity and two-dimensional nearest-neighbor coupling, our method reveals a learnability transition with increasing circuit depth: below a threshold, the uncertainty is small and decreases with polynomial measurement data and model parameters, while above it the uncertainty remains large despite increasing resources. We further verify this transition experimentally on current noisy quantum devices, demonstrating its robustness to realistic noise. These results highlight the power of data-driven approaches for learning MIE and delineate the practical limits of its classical learnability.


Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

Piedrahita, David Guzman, Strauss, Irene, Schölkopf, Bernhard, Mihalcea, Rada, Jin, Zhijing

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs.


Dejavu: Towards Experience Feedback Learning for Embodied Intelligence

Wu, Shaokai, Ji, Yanbiao, Li, Qiuchang, Zhang, Zhiyi, He, Qichen, Xie, Wenyuan, Zhang, Guodong, Bayramli, Bayram, Ding, Yue, Lu, Hongtao

arXiv.org Artificial Intelligence

Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire additional knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN identifies contextually prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards to train EFN, ensuring that the predicted actions align with past behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit "learning from experience". Experiments across diverse embodied tasks show that EFN improves adaptability, robustness, and success rates over frozen baselines. We provide code and demo in our supplementary material.


HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks

Assadi, Adnan El, Chung, Isaac, Solomatin, Roman, Muennighoff, Niklas, Enevoldsen, Kenneth

arXiv.org Artificial Intelligence

Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.


The World Cup draw is here - this is how it will work

BBC News

Pots, quadrants, confederation constraints, group position grids... the 2026 World Cup finals draw on Friday is not going to be a straightforward affair. There's a lot to unpack so we're going to explain it as simply as we can. Luckily, Fifa will have a computer to do most of the heavy lifting and make sure everything runs smoothly. Though as Uefa found out in 2021, sometimes technology does go wrong. Let's hope there will be no gremlins in Washington once the draw ceremony kicks off.


Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Lin, Jingyang, Wu, Jialian, Sun, Ximeng, Wang, Ze, Liu, Jiang, Su, Yusheng, Yu, Xiaodong, Chen, Hao, Luo, Jiebo, Liu, Zicheng, Barsoum, Emad

arXiv.org Artificial Intelligence

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.